{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2f01b288",
   "metadata": {},
   "source": [
    "![image](img/librarian_bot.png)\n",
    "\n",
    "# RAG bot for investor information\n",
    "\n",
    "Exercise for week 5 of LLM Engineering course\n",
    "\n",
    "Specialized bot focusing on analysing financial documents from Investor Relations webpages. \n",
    "Comes together with a web crawler spider to gather documents quickly.\n",
    "\n",
    "This notebook will create a personal RAG bot. It will use a the ./kb directory to store the files that we want to include in the RAG. Subdirectories will be used to denote categories for the files.\n",
    "**Important: only one level of subdirectories will be used for the categories**\n",
    "\n",
    "It uses LangChain to create and process the RAG pipeline and chat.\n",
    "The vector database persistent sotre is in the ./vdb folder. \n",
    "\n",
    "In this version we use chromadb for the vector store.\n",
    "The store is recreated each run. This is not efficient for large datasets. \n",
    "\n",
    "Future upgrades - To Do (in no particular order): \n",
    "- [x] Create a fully local version for security and privacy (*see v01_local*) <span style=\"color:orange\">\n",
    "        NOTE: will require a fairly advanced LLM to answer questions without losing context. 2-4bn parameters LLM's struggle and tend to hallucinate. Best options are gpt-4o-mini and claude-3.5-haiku.</span>\n",
    "- [x] Fine tune the pdf scraper to handle financial reports better\n",
    "- [x] Create custom retriever for financial information\n",
    "- [ ] Create persistent data store between runs - only load, chunk and embed changed documents. \n",
    "- [ ] Provide selection of vector db engines (Chroma DB as default, or connect to external vector db e.g. ElasticSearch or AWS Opensearch)\n",
    "- [ ] Add an interface to upload documents in data store - including user-defined metadata tags\n",
    "- [ ] Multimodality: Process more document data types (e.g. ppt) \n",
    "- [x] Add online search capability - use web crawler tool to crawl a website and create website-specific RAG bot\n",
    "- [ ] Read e-mails/calendars/online docs (Amazon S3 bucket, Google Drive)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6dfe8e48",
   "metadata": {},
   "outputs": [],
   "source": [
    "# These were necessary as langchain does not install them by default\n",
    "# !pip install pypdf\n",
    "# !pip install pdfminer.six\n",
    "# !pip install python-docx\n",
    "!pip install docx2txt\n",
    "!pip install pymupdf4llm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "193171c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# imports\n",
    "\n",
    "import os\n",
    "import glob\n",
    "from dotenv import load_dotenv\n",
    "import gradio as gr\n",
    "\n",
    "# imports for langchain, plotly and Chroma\n",
    "# plotly is commented out, as it is not used in the current code\n",
    "\n",
    "from langchain.document_loaders import DirectoryLoader, TextLoader, PDFMinerLoader, Docx2txtLoader\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "# from langchain.schema import Document\n",
    "from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n",
    "from langchain_chroma import Chroma\n",
    "#import matplotlib.pyplot as plt\n",
    "#from sklearn.manifold import TSNE\n",
    "#import numpy as np\n",
    "#import plotly.graph_objects as go\n",
    "from langchain.memory import ConversationBufferMemory\n",
    "from langchain.chains import ConversationalRetrievalChain\n",
    "from langchain.embeddings import HuggingFaceEmbeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d22d2e48",
   "metadata": {},
   "outputs": [],
   "source": [
    "MODEL = \"gpt-4o-mini\"\n",
    "db_name = \"vdb\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc23bf8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load environment variables in a file called .env\n",
    "\n",
    "load_dotenv(override=True)\n",
    "os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0103ef35",
   "metadata": {},
   "source": [
    "## Loading the documents\n",
    "\n",
    "In the code below we read in the KB documents and create the vector store. \n",
    "We will be adding PDF documents, Word documents and text/markdown documents. \n",
    "Each document has its own loader, which we are calling separately through DirectoryLoader.\n",
    "For PDF we implement custom loader to manage financial data. \n",
    "\n",
    "At the end, we are combining the results, and then start splitting the documents using the Recursive Character Text Splitter.\n",
    "\n",
    "This approach is not optimal for financial tables.\n",
    "TO DO:\n",
    " - [x] Replace splitter with better technique that preserves tables.\n",
    " - [x] Replace PDF Reader with pymupdf4llm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "918cbbf0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Utility functions for EU financial reporting (read from PDF)\n",
    "# We're using pymupdf4llm for better handling of financial reports\n",
    "# This function does not utilize a loader class, but directly processes the PDF file\n",
    "# It extracts financial sections and returns them as Document objects\\\n",
    "\n",
    "import pymupdf4llm\n",
    "from langchain.schema import Document\n",
    "import re\n",
    "import string\n",
    "from pathlib import Path\n",
    "\n",
    "def extract_eu_financial_reports(pdf_path):\n",
    "    \"\"\"\n",
    "    Extracts financial sections from an EU financial report PDF using pymupdf4llm.\n",
    "\n",
    "    Args:\n",
    "        pdf_path (str): Path to the PDF file.\n",
    "\n",
    "    Returns:\n",
    "        list[Document]: A list of LangChain Document objects, each representing a detected financial section\n",
    "                        (e.g., income statement, balance sheet, cash flow statement, etc.) with associated metadata.\n",
    "\n",
    "    The function processes the PDF, detects section headers based on common financial report section names,\n",
    "    and splits the content accordingly. Each Document contains the section text and metadata including section name,\n",
    "    content type, source file, and page range.\n",
    "    \"\"\"\n",
    "    md_text = pymupdf4llm.to_markdown(\n",
    "        pdf_path,\n",
    "        page_chunks=True,  # Preserve page boundaries\n",
    "        write_images=False,\n",
    "        embed_images=False\n",
    "    )\n",
    "    \n",
    "    # EU financial reports have predictable structures\n",
    "    financial_sections = [\n",
    "        \"consolidated income statement\", \"profit and loss\", \"p&l\", \"remuneration report\",\n",
    "        \"balance sheet\", \"cash flow statement\", \"statement of financial position\",\n",
    "        \"notes to the consolidated financial statements\", \"segment reporting\",\n",
    "        \"risk management\", \"capital adequacy\", \"basel\", \"ifrs\", \"regulatory capital\"\n",
    "    ]\n",
    "    \n",
    "    documents = []\n",
    "    current_section = None\n",
    "    current_content = \"\"\n",
    "    start_page = 1\n",
    "    \n",
    "    for page_dict in md_text:\n",
    "        # Extract the actual text content from the dictionary\n",
    "        page_content = page_dict.get(\"text\", \"\")\n",
    "        page_num = page_dict.get(\"page\", start_page)\n",
    "\n",
    "        # Detect financial section headers\n",
    "        content_lower = page_content.lower()\n",
    "        detected_section = None\n",
    "        \n",
    "        for section in financial_sections:\n",
    "            if section in content_lower:\n",
    "                detected_section = section\n",
    "                break\n",
    "        \n",
    "        # Process section changes\n",
    "        if detected_section and detected_section != current_section:\n",
    "            if current_content:\n",
    "                # Save previous section\n",
    "                documents.append(Document(\n",
    "                    page_content=current_content.strip(),\n",
    "                    metadata={\n",
    "                        \"content_type\": \"financial_statement\",\n",
    "                        \"section\": current_section or \"general\",\n",
    "                        \"source\": pdf_path,\n",
    "                        \"pages\": f\"{start_page}-{page_num-1}\"\n",
    "                    }\n",
    "                ))\n",
    "            current_section = detected_section\n",
    "            current_content = page_content\n",
    "        else:\n",
    "            current_content += \"\\n---\\n\" + page_content\n",
    "    \n",
    "    # Handle final section\n",
    "    if current_content:\n",
    "        documents.append(Document(\n",
    "            page_content=current_content.strip(),\n",
    "            metadata={\n",
    "                \"content_type\": \"financial_statement\",\n",
    "                \"section\": current_section or \"general\",\n",
    "                \"source\": pdf_path,\n",
    "                \"pages\": f\"{start_page}-{page_num}\"\n",
    "            }\n",
    "        ))\n",
    "    \n",
    "    return documents\n",
    "\n",
    "# Utility functions for loading documents from a folder\n",
    "def load_eu_financial_reports_from_directory(directory_path: str, glob_pattern: str = \"*.pdf\"):\n",
    "    \"\"\"\n",
    "    Load and process all EU financial reports from a directory.\n",
    "\n",
    "    Args:\n",
    "        directory_path (str): Path to the directory containing PDF files\n",
    "        glob_pattern (str, optional): Pattern to match PDF files. Defaults to \"*.pdf\"\n",
    "\n",
    "    Returns:\n",
    "        list[Document]: A list of LangChain Document objects containing the extracted financial sections\n",
    "                       from all successfully processed PDFs in the directory.\n",
    "\n",
    "    The function iterates through PDF files in the specified directory that match the glob pattern,\n",
    "    processes each file using extract_eu_financial_reports(), and combines the results into a single list.\n",
    "    Files that cannot be processed are skipped with an error message printed to stdout.\n",
    "    \"\"\"\n",
    "    all_documents = []\n",
    "    directory = Path(directory_path)\n",
    "    \n",
    "    for pdf_file in directory.glob(glob_pattern):\n",
    "        try:\n",
    "            documents = extract_eu_financial_reports(str(pdf_file))\n",
    "            all_documents.extend(documents)\n",
    "        except Exception as e:\n",
    "            print(f\"Error processing {pdf_file}: {e}\")\n",
    "    \n",
    "    return all_documents\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f20fd20",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read in documents using LangChain's loaders\n",
    "# Take everything in all the sub-folders of our knowledgebase\n",
    "\n",
    "folders = glob.glob(\"kb/*\")\n",
    "print(f\"Found {len(folders)} folders in the knowledge base.\")\n",
    "\n",
    "def add_metadata(doc, doc_type):\n",
    "    doc.metadata[\"doc_type\"] = doc_type\n",
    "    return doc\n",
    "\n",
    "# For text files\n",
    "text_loader_kwargs = {'encoding': 'utf-8'}\n",
    "\n",
    "documents = []\n",
    "for folder in folders:\n",
    "    print(f\"Loading documents from folder: {folder}\")\n",
    "    doc_type = os.path.basename(folder)\n",
    "    # PDF Loader\n",
    "    # We're not using the PDFMinerLoader as it does not handle EU financial reports well.\n",
    "    # Instead, we use our custom extract_eu_financial_reports function.\n",
    "    # Uncomment the next line if you want to use the standard loader for PDF files\n",
    "    # pdf_loader = DirectoryLoader(folder, glob=\"**/*.pdf\", loader_cls=extract_eu_financial_reports)\n",
    "    # Text loaders\n",
    "    txt_loader = DirectoryLoader(folder, glob=\"**/*.txt\", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)\n",
    "    md_loader = DirectoryLoader(folder, glob=\"**/*.md\", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)\n",
    "    # Load MS Word documents - UnstructuredWordDocumentLoader does not play well with numpy > 1.24.0, and we use Docx2txtLoader instead. \n",
    "    # doc_loader = DirectoryLoader(folder, glob=\"**/*.doc\", loader_cls=UnstructuredWordDocumentLoader)\n",
    "    docx_loader = DirectoryLoader(folder, glob=\"**/*.docx\", loader_cls=Docx2txtLoader)\n",
    "    # document doc_type is used to identify the type of document\n",
    "    # Load documents from PDF, text and word files and combine the results\n",
    "    pdf_docs = load_eu_financial_reports_from_directory(folder)\n",
    "    print(f\"Loaded {len(pdf_docs)} PDF documents from {folder}\")\n",
    "    text_docs = txt_loader.load() + md_loader.load()\n",
    "    print(f\"Loaded {len(text_docs)} text documents from {folder}\")\n",
    "    word_docs = docx_loader.load()\n",
    "    print(f\"Loaded {len(word_docs)} Word documents from {folder}\")\n",
    "    folder_docs = pdf_docs + text_docs + word_docs\n",
    "    # Add metadata to each document\n",
    "    if not folder_docs:\n",
    "        print(f\"No documents found in folder: {folder}\")\n",
    "        continue\n",
    "    documents.extend([add_metadata(doc, doc_type) for doc in folder_docs])\n",
    "\n",
    "# Split the documents into chunks\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)\n",
    "chunks = text_splitter.split_documents(documents)\n",
    "\n",
    "# Print out some basic info for the loaded documents and chunks\n",
    "print(f\"Total number of documents: {len(documents)}\")\n",
    "print(f\"Total number of chunks: {len(chunks)}\")\n",
    "print(f\"Document types found: {set(doc.metadata['doc_type'] for doc in documents)}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "749ad5d8",
   "metadata": {},
   "source": [
    "## Vector Store\n",
    "\n",
    "We use Chromadb for vector store.\n",
    "\n",
    "Same code as the one in the lesson notebook, minus the visualization part\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "efc70e3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "#embeddings = OpenAIEmbeddings()\n",
    "\n",
    "# If you would rather use the free Vector Embeddings from HuggingFace sentence-transformers\n",
    "# Then replace embeddings = OpenAIEmbeddings()\n",
    "# with:\n",
    "from langchain.embeddings import HuggingFaceEmbeddings\n",
    "embeddings = HuggingFaceEmbeddings(model_name=\"sentence-transformers/all-mpnet-base-v2\") # A bit slower, but better than all-MiniLM-L6-v2 for financial documents\n",
    "\n",
    "# Delete if already exists\n",
    "\n",
    "if os.path.exists(db_name):\n",
    "    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()\n",
    "\n",
    "# Create vectorstore\n",
    "\n",
    "vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)\n",
    "print(f\"Vectorstore created with {vectorstore._collection.count()} documents\")\n",
    "\n",
    "# Let's investigate the vectors\n",
    "\n",
    "collection = vectorstore._collection\n",
    "count = collection.count()\n",
    "\n",
    "sample_embedding = collection.get(limit=1, include=[\"embeddings\"])[\"embeddings\"][0]\n",
    "dimensions = len(sample_embedding)\n",
    "print(f\"There are {count:,} vectors with {dimensions:,} dimensions in the vector store\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9af1d32",
   "metadata": {},
   "source": [
    "## LangChain\n",
    "Create Langchain chat, memory and retrievers.\n",
    "\n",
    "Trying a number of LLM's for ollama. They are not very good at sortingo out the relevant information from financial documents - they do provide results, but tend to be overly chatty and especially the specific numbers can be hallucinated or taken out of context. \n",
    "\n",
    "GPT-4o-mini provided much more accurate answers to specific questions, even with huggingface's embeddings for the vector store. \n",
    "\n",
    "Implemented (with Claude's help) a custom retriever and prompt to focus on financial statement analysis.\n",
    "\n",
    "### OpenAI rate limits\n",
    "*Note*: If using OpenAI's embeddings, there's a limit of 300K tokens per request. This requires special handling when calling Chroma.from_documents.\n",
    "###TO DO:\n",
    "- [ ] Add rate limiter for encoding documents and encode in batches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59f75e5d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Specialized Retriever for consolidated financials\n",
    "\n",
    "from langchain.schema import BaseRetriever, Document\n",
    "from typing import List\n",
    "\n",
    "from langchain.vectorstores.base import VectorStoreRetriever\n",
    "\n",
    "class EUFinancialRetriever(VectorStoreRetriever):\n",
    "    def _get_relevant_documents(self, query: str, *, run_manager=None) -> List[Document]:\n",
    "        query_lower = query.lower()\n",
    "        k = self.search_kwargs.get(\"k\", 5)\n",
    "        \n",
    "        # Section-aware search logic\n",
    "        section_queries = {\n",
    "            'income': ['income', 'revenue', 'profit', 'earnings'],\n",
    "            'balance': ['balance', 'assets', 'liabilities', 'equity'],\n",
    "            'cash': ['cash flow', 'operating cash', 'free cash']\n",
    "        }\n",
    "        \n",
    "        for section, terms in section_queries.items():\n",
    "            if any(term in query_lower for term in terms):\n",
    "                try:\n",
    "                    return self.vectorstore.similarity_search(\n",
    "                        query, k=k, filter={\"section\": section}\n",
    "                    )\n",
    "                except:\n",
    "                    break\n",
    "        \n",
    "        # Fallback to standard search\n",
    "        return self.vectorstore.similarity_search(query, k=k)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aca30d15",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Specialized prompt for the retriever\n",
    "\n",
    "financial_prompt = \"\"\"\n",
    "You are analyzing EU bank and corporate financial statements. When answering:\n",
    "\n",
    "1. For numerical data, ALWAYS cite the specific financial statement section\n",
    "2. Consider regulatory context (IFRS, Basel III for banks)\n",
    "3. Note if data spans multiple periods or segments\n",
    "4. Highlight any footnotes or adjustments mentioned\n",
    "5. Be precise about currency and units (EUR millions, thousands, etc.)\n",
    "\n",
    "Context from financial statements:\n",
    "{context}\n",
    "\n",
    "Question: {question}\n",
    "\n",
    "Answer:\n",
    "\"\"\"\n",
    "# Updated chain with financial-aware prompt\n",
    "from langchain.prompts import PromptTemplate\n",
    "\n",
    "prompt = PromptTemplate(\n",
    "    input_variables=[\"context\", \"question\"],\n",
    "    template=financial_prompt\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2360006e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a new Chat with OpenAI\n",
    "llm = ChatOpenAI(temperature=0.7, model_name=MODEL)\n",
    "\n",
    "# Alternative - if you'd like to use Ollama locally, uncomment this line instead\n",
    "#llm = ChatOpenAI(temperature=0.7, model_name='gemma3:4b', base_url='http://localhost:11434/v1', api_key='ollama')\n",
    "\n",
    "# set up the conversation memory for the chat\n",
    "memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)\n",
    "\n",
    "# the retriever is an abstraction over the VectorStore that will be used during RAG\n",
    "retriever = EUFinancialRetriever(\n",
    "    vectorstore=vectorstore, \n",
    "    search_kwargs={\"k\": 5}\n",
    ")\n",
    "# putting it together: set up the conversation chain with the GPT 3.5 LLM, the vector store and memory\n",
    "conversation_chain = ConversationalRetrievalChain.from_llm(\n",
    "    llm=llm, \n",
    "    retriever=retriever, \n",
    "    memory=memory, \n",
    "    combine_docs_chain_kwargs={\"prompt\": prompt},\n",
    "    return_source_documents=False\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88a21bb3",
   "metadata": {},
   "source": [
    "## UI part\n",
    "Create Gradio interface\n",
    "\n",
    "Simple built-in chat interface\n",
    "\n",
    "###To Do: \n",
    "- [ ] Add model selector for Claude 3.5 Haiku\n",
    "- [ ] Update interface to handle sources (with **return_source_documents=True**)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0dfe7d75",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Wrapping that in a function\n",
    "\n",
    "def chat(question, history):\n",
    "    result = conversation_chain.invoke({\"question\": question})\n",
    "    return result[\"answer\"]\n",
    "\n",
    "# And in Gradio:\n",
    "\n",
    "view = gr.ChatInterface(chat, type=\"messages\").launch(inbrowser=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llms",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
